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Abstract 

This article introduces an algorithm, MERGES MUFFLE, which is an extremely efficient algorithm 
to generate random permutations (or to randomly permute an existing array). It is easy to implement, 
runs in nlog 2 n + 0(1) time, is in-place, uses nlog 2 n -|- 0(n) random bits, and can be parallelized ac- 
cross any number of processes, in a shared-memory PRAM model. Finally, our preliminary simulations 
using OpenMF[^suggest it is more efficient than the Rao-Sandelius algorithm, one of the fastest existing 
random permutation algorithms. 

We also show how it is possible to further reduce the number of random bits consumed, by introdu¬ 
cing a second algorithm BalancedSMUFFLE, a variant of the Rao-Sandelius algorithm which is more 
conservative in the way it recursively partitions arrays to be shuffled. While this algorithm is of lesser 
practical interest, we believe it may be of theoretical value. 

Random permutations are a basic combinatorial object, which are useful in their own right for a lot of 
applications, but also are usually the starting point in the generation of other combinatorial objects, notably 
through bijections. 

The well-known Fisher-Yates shuffle ifTTl ITOl iterates through a sequence from the end to the beginning 
(or the other way) and for each location i, it swaps the value at i with the value at a random target location 
j at or before i. This algorithm requires very few steps — indeed a random integer and a swap at each 
iteration—and so its efficiency and simplicity have until now stood the test of time. 

But there have been two trends in trying to improve this algorithm: first, initially the algorithm assumes 
some source of randomness that allows for discrete uniform variables, but this there has been a shift towards 
measuring randomness better with the random bit model; second, with the avent of large core clusters and 
GPUs, there is an interest in making parallel versions of this algorithm. 

The random-bit model. Much research has gone into simulating probability distributions, with most 
algorithms designed using infinitely precise continuous uniform random variables (see El II.3.7]). But 
because (pseudo-)randomness on computers is typically provided as 32-bit integers—and even bypassing 
issues of true randomness and bias—this model is questionable. Indeed as these integers have a fixed 
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Algorithm 1 The classical Fisher-Yates shuffle IfTTl to generate random permutations, as per Dursten- 
feld Idol._ 

1: procedure FisherYatesShuffleCT) 

2: for i = n — 1 to 0 do 

3: j ■<— random integer from {0,..., i} 

4: SWAP(T, i, j) 

5: end for 

6 : end procedure 


precision, two questions arise: when are they not precise enough? when are they too precise? These 
are questions which are usually ignored in typical fixed-precision implementations of the aforementioned 
algorithms. And it suggests the usefulness of a model where the unit of randomness is not the uniform 
random variable, but the random bit. 

This random bit model was first suggested by Von Neumann 123, who humorously objected to the use of 
fixed-precision pseudo-random uniform variates in conjunction with transcendant functions approximated 
by truncated series. His remarks and algorithms spurred a fruitful line of theoretical research seeking to 
determine which probabilities can be simulated using only random bits (unbiased or biased? with known or 
unknown bias?), with which complexity (expected number of bits used?), and which guarantees (finite or 
infinite algorithms? exponential or heavy-tailed time distribution?). Within the context of this article, we 
will focus on designing practical algorithms using unbiased random bits. 

In 1976, Knuth and Yao IfTSlI provided a rigorous theoretical framework, which described generic op¬ 
timal algorithms able to simulate any distribution. These algorithms were generally not practically usable: 
their description was made as an infinite tree—infinite not only in the sense that the algorithm terminates 
with probability 1 (an unavoidable fact for any probability that does not have a finite binary expansion), but 
also in the sense that the description of the tree is infinite and requires an infinite precision arithmetic to 
calculate the binary expansion of the probabilities. 

In 1997, Han and Hoshi ifTTl provided the interval algorithm, which can be seen as both a generalization 
and implementation of Knuth and Yao’s model. Using a random bit stream, this algorithm amounts to 
simulating a probability p by doing a binary search in the unit interval: splitting the main interval into two 
equal subintervals and recurse into the subinterval which contains p. This approach naturally extends to 
splitting the interval in more than two subintervals, not necessarily equal. Unlike Knuth and Yao’s model, 
the interval algorithm is a concrete algorithm which can be readily programmed... as long as you have 
access to arbitrary precision arithmetic (since the interval can be split to arbitrarily small sizes). This work 
has recently been extended and generalized by Devroye and Gravel [91. 

We were introduced to this problematic through the work of Flajolet, Pelletier and Soria lfT2l on Buffon 
machines, which are a framework of probabilistic algorithms allowing to simulate a wide range of probabi¬ 
lities using only a source of random bits. 

One easy optimization of the Fisher-Yates algorithm (which we use in our simulations) is to use an 
recently discovered optimal way of drawing discrete uniform variables |[T9l . 

Prior Work in Parallelization. There has been in particular a great deal of interest in finding efficient 
parallel algorithms to randomly generate permutations, in various many contexts of parallelization, some 
theoretical and some practical |[T4l[T5ll2^[I^ [ni5ll6ll2ll. 

Most recently. Shun et al. 1(2^ wrote an enlightening article, in which they looked at the intrinsic pa- 
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rallelism inherent in classical sequential algorithms, and these can be broken down into independent parts 
which may be executed separately. One of the algorithms they studied is the Fisher-Yates shuffle. They 
considered the insertion of each element of the algorithm as a separate part, and showed that the dependency 
graph, which provides the order in which the parts must be executed, is a random binary search tree, and 
as such, is well known to have on average a logarithmic height [8]. This allowed them to show that the 
algorithm could be distributed on n/ log n processors. 

Because they aimed for generality (and designed a framework to adapt other similar sequential algo¬ 
rithms), their resulting algorithm is not as optimized as can be. 

We believe our contribution improves on this work by providing a parallel algorithm with similar gua¬ 
rantees, and which runs, in practice, extremely fast. 


Algorithm 2 The MERGESHUFFLE algorithm. 

1: procedure MergeShuffle(T, k) > kis the cut-off threshold at which to shuffle with Fisher-Yates. 
2: Divide T into 2^ blocks of roughly the same size 

3: Shuffle each block independently using the Fisher-Yates shuffle 

4: p 

5: repeat 

6: Use the Merge procedure to merge adjacent blocks of size 2 p into new blocks of size 2^+^ 

7: p ^ p -h 1 

8: until T consists of a single block 

9: end procedure 


Splitting Processes. Relatively recently, Flajolet et al. |[T^ formulated an elegant random permutation 
algorithm which uses only random bits, using the trie data structure, which models a splitting process: 
associate to each element of a set x G S' an infinite random binary word Wx, and then insert the key-value 
pairs {wx,x) into the trie; the ordering provided by the leaves is then a random permutation. 

This general concept is elegant, and it is optimized in two ways: 

• the binary words thus do not need to be infinite, but only long enough to completely distinguish the 
elements; 

• the binary words do not need to be drawn a priori, but may be drawn one bit (at each level of the trie) 
at a time, until each element is in a leaf of its own. 

This algorithm turns out to have been already exposed in some form in the early 60’s, independently by 
Rao II 20 II and by Sandelius |22|. Their generalization extends to the case where we split the set into R 
subsets (and where we would then draw random integers instead of random bits), but in practice the case 
i? = 2 is the most efficient. The interest of this algorithm is that it is, as far as we know, the first example of 
a random permutation algorithm which was written to be parallelized. 

1 The MergeShuffle algorithm 

The new algorithm which is the central focus of this paper was designed by progressively optimizing a 
splitting-type idea for generating random permutation which we discovered in Flajolet et al. IfT^ . The 
resulting algorithm closely mimics the structure and behavior of the beloved MERGESORT algorithm. It 


3 






gets the same guarantees as this sorting algorithm, in particular with respect to running time and being 
in-place. 

To optimize the execution of this algorithm, we also set a cut-off threshold, a size below which permu¬ 
tations are shuffled using the Fisher-Yates shuffle instead of increasingly smaller recursive calls. This is an 
optimization similar in spirit to that of MERGESORT, in which an auxiliary sorting algorithm is used on 
small instances. 

1.1 In-Place Shuffled Merging 

The following algorithm is the linchpin of the MergeShuffle algorithm. It is a procedure that takes two 
arrays (or rather, two adjacent ranges of an array T), both of which are assumed to be randomly shuffled, 
and produces a shuffled union. 

Importantly, this algorithm uses very few bits. Assuming a two equal-sized sub-arrays of size k each, 
the algorithm requires 2k + Q{Vk log k) random bits, and is extremely efficient in time because it requires 
no auxiliary space. (We show an a 

Algorithm 3 In-place shuffled merging of two random sub-arrays. 

1: procedure Merge(T, s, ni, 71 - 2 ) 

2 : i s > i, j,n are the beginning, middle, and end position considered in the array. 

3: j <— s -|- ni 

4: n s + ni + n 2 

5: loop 

6: if Flip() = 0 then > Flip a coin to determine which sub-array to take an element from. 

7: if z = j then break 

8 : else 

9: if j = n then break 

10: SWAP(r, i, j) 

11 : j^j + 1 

12: end if 

13: i i — i -f- 1 

14: end loop 

15: while z < n do t> One list is depleted; use Fisher-Yates to finish merging. 

16: Draw a random integer m G {s,..., z} 

17: Swap(T, z, m) 

18: i ^ i + 1 

19: end while 

20 : end procedure 


Lemma 1. Let A and B be two randomly shuffled arrays, respectively of sizes ni and 02 - Then the procedure 
Merge produces a randomly shuffled union C of these arrays, of size n = ni + n 2 - 

Proof For every integer /c ^ 0, let Aj. be the event that, after the execution of the first loop (lines 5 to 14) 
of the procedure Merge, k elements of the list A remain (j = n and i = n — k). Similarly, let be the 
event that k elements of the list B remain (i = j = n — k). We prove that, conditionally to every A^ and 
Bk, the array is randomly shuffled after the procedure. We can then conclude from Bayes’s theorem shows 
that this is also true unconditionally. 
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Let A: ^ 0 and condition by the event (the case of is identical). After the execution of the first 
loop, the n — k first elements of the array consist of: m — k elements of A; and all n 2 elements of B. Let 
w be the word composed of the n — k + 1 random bits drawn by the first loop. The word w ends with a 1 
(this bit corresponds to picking an element from B, which, at that point is depleted, causing the loop to be 
exited). Among the remaining n — k bits, ni — A: are O’s and n 2 are I’s, and for 0 ^ i < n — A:, the element 
C[i] is from A if ruj = 0 and from B otherwise. Since A and B are randomly shuffled and since all words w 
are drawn wifh equal probabilify, fhis implies fhaf fhe firsl n — k elemenfs are randomly shuffled. 

Finally, we use fhe following loop invarianf, which is fhe same loop invarianf as in fhe proof of fhe 
Fisher-Yales algorilhm: afler every execution of fhe second loop (lines 15 fo 19), fhe firsl i elemenfs of fhe 
array are randomly shuffled. This shows lhal fhe array is randomly shuffled afler fhe whole procedure. □ 


Lemma 2. The procedure Merge produces a shuffled array C of size n using n + &{^/nlogn) random 
bits. 

Proof The number of random bils used depends again, on fhe size m of fhe word w drawn during fhe 
firsl loop (lines 5 lo 14). Indeed for m — 1 of fhe elemenfs of C, we will have shuffled Ihem using only a 
single bil; for fhe remaining 2A: — m + 1 elemenfs, we musl inserl Ihem in C by drawing random inleger of 
increasing range m, ...,2k. 

The number of random bils used only depends on fhe number of limes fhe firsl loop (lines 5 lo 14) is 
execuled. Indeed for m — 1 of fhe elemenfs of C, we will have shuffled Ihem using only a single bil, and 
for fhe remaining n — m + 1 elemenfs, we musl inserl Ihem in C by drawing random inleger of increasing 
range m, ...,n. The overall average number of random bils used is 


m 


+ \^og2k]. 


k=m 


The firsl m bils are used during fhe firsl loop and fhe resl are used lo draw discrele uniform laws during fhe 
second loop. 

The firsl loop slops eilher because we have drawn ni + 1 O’s or n 2 + 1 I’s (whichever occurs firsl). In 
Ihe firsl case, Ihe average number of random bils used is Ihus 

In Ihis expression i represenls Ihe number of I’s lhal were drawn before Ihe (ni + 1)*^ 0 was drawn. 
Similarly we oblain Ihe following expression for Ihe second case 

i=0 ^ ^ \ k=n 2 +l+i ) 

The sum of Ihose Iwo expressions gives Ihe average number of random bils used by Ihe algorilhm. By 
using Ihe following upper and lower bound 

n 

log 2 (m)(n — m + 1) < riog 2 A:] < log 2 (n)(n — m + 1) 


k=m 


we oblain Ihe following asymptotic behaviour for Ihe average number of random bils used 

n + 0(\/nlogn). 


□ 
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1.2 Average number of random bits of Merge Shuffle 

We now give an estimate of the average number of random bits used by our algorithm to sample a random 
permutation of size re. Let cost{k) denote the average number of random bits used by a merge operation 
with an output of size k. For the sake of simplicity, assume that we sample a random permutation of size 
re = 2'". The average number of random bits used is then 

m 

i=l 

We have seen that cost{k) = k + Q{Vk log k). Thus, the average number of random bits used to sample 
a random permutation of size re = 2”^ is 

m / m . 

^ 2”^-* ( 2 * + 0 log(2*)) ) = rre2™ + © 2"* ^ ^ 

i=l V i=l ^ 

which finally yields 

rre2™' + 0(2”^) = relog 2 re + 0(re). 

2 BalancedShuffle 

For theoretical value, we also present a second algorithm, which introduces an optimization which be believe 
has some worth. 


Algorithm 4: The BalancedShuffle algorithm. 

Input: an array T 
Result: T is randomly shuffled 
Main Function BalancedShuffle(T) 
n = length(T) 
if n > 1 then 

BalancedShuffle (T[0:|]) 

BalancedShuffle CT[|:n|) 

BalancedMerge(T) 
end 

Procedure BalancedMerge (T) 

n = length(T) 

w = uniformly sampled random balanced word of size n 
j = n/2 

for k = 0 to n — I do 
if iu[k] = 1 then 

I swap elements at positions i and j in T 
j = j +1 
end 
i = 

end 


2.1 Balanced Word 

Inspired by Remy |[^ ’s now classical and efficient algorithm to generate random binary of exact size from 
the repeated drawing of random integers, Bacher et al. [3i| produced a more efficient version that uses, on 
average 2k + 0((log kY). Binary trees, which are enumerated by the Catalan numbers GSTl . are in bijection 
with Dyck words, which are balanced words containing as many O’s as I’s. So Bacher et al.’s random tree 
generation algorithm can be used to produce a balanced word of size 2k using very few extra bits. 
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Rationale. The idea behind using a balanced word is that it is more efficient, in average number of bits. 

Indeed, splitting processes (repeatedly randomly partition n elements until each is in its own partition), 
are well known to require n log 2 n + 0{n) bits on average — this is the path length of a random trie |[T3l . 
The linear term comes from the fact that when processes are partitioned in two subsets, these subsets are 
not of equal size (which would be the optimal case), but can be very unbalanced; furthermore, with small 
probability, it is possible that all elements remain in the same set, especially in the lower levels. 

On the other hand, if we are able to partition the elements into two equal-sized subsets, we should be 
able to circumvent this issue. This idea is useful here, and we believe, would be useful in other contexts as 
well. 

Disadvantages. The advantage is that using balanced words allows to for a more efficient and sparing use 
of random bits (and since random bits cost time to generate, this eventually translates to savings in running 
time). However this requires a linear amount of auxiliary space; for this reason, our BalancedShuffled 
algorithm is generally slower than the other, in-place algorithms. 

2.2 Correctness 

Denote by Sn the symmetric group containing all permutations of size n. Let C{n, k) be the set of all words 
of length n on the alphabet {0, 1 } containing k O’s and n — k I’s. We have \Sn\ = nl and C{n, k) = (^). 
We first prove the following lemma: 

Lemma 3. Assume we have a list of n elements and a list of m other elements. Shuffle both of them 
uniformly at random independently (i.e. sample an element in Sn and an element in Sm independently and 
uniformly at random). Now sample a word in C{n + m, n) uniformly at random. With the process from 
Bacher et al. we obtain a list of size n + m that is a uniformly sampled random permutation of the 
n + m elements. 

Proof. We have defined a function 

F . Sn X Sni X C{jlTTl^Tl) )■ Sn-\-m' (1) 

F is a surjection, because any given permutation of the n+m elements can be obtained by choosing adequate 
permutations of size n and m, as well as an adequate word in C(n -|- m, n). Moreover, we have 

^ ^ = {n + m)\ = \Sn+rn\ (2) 

n J 

where | • | denotes the cardinality of a set. This implies that F is actually bijective. Thus, any element of 
Sn+m has the same probability of occurring, as it is obtained by a unique element of Sn x Sm x C'(n-|-m, n). 
[end of proof] 

If we use the “bottom-up” approach (we start with lists containing only one element and work our way 
up), it follows by induction that the final list is indeed a uniformly sampled random permutation. 

If we use the “top-down” approach, the starting list is a uniformly sampled random permutation of the 
final lisf, thus the final list is a uniformly sampled random permutation of the starting list (the inverse of a 
uniformly sampled random permutation is still a uniformly sampled random permutation). □ 


| 5 'n X Sm X C{n + m.,n)\ = n\m\ 


1 


2.3 Average number of random bits 

We now give an estimate of the average number of random bits used by our algorithm to sample a random 
permutation of size n. Let cost{2k) denote the average number of random bits used to sample a random 
balanced word of length 2k (an element of C{2k, k)). For the sake of simplicity, assume that we sample a 
random permutation of size n = 2™. The average number of random bits used is then 

m 

^2”^-*cosf(2*). 
i=l 

For the algorithm we have cost{2k) = 2k + 0(log^ k) 
used to sample a random permutation of size n = 2™ is 

m 

2™-* (2* + 0 (log2(2*-^))) = m2™ + 0 

i=l 

which can be rewritten as 

nlog 2 n + 0(n). 

3 Simulations 

The simulations were run on a computing cluster with 40 cores. The algorithms were implemented in C, 
and their parallel versions were implemented using the OpenMP library, and delegating the distribution of 
the threading entirely to it. 

Our algorithm, MERGES HUFFLE, 


O. Thus, the average number of random bits 


m .2 \ 

2™ ^ ^ = m2™ + 0(2™) 

. i=i ^7 



Figure 1. Running times of several random permutation algorithms. Fisher-Yates shuffle, while extremely 
fast, gets slowed down once permutations are very large. Our parallel MergeShuffle algorithm is consis¬ 
tently faster than all algorithms, although the lead is not yet much compare to the Rao-Sandelius algorithm. 
















n 

10® 

10® 

10^ 

10® 

Fisher-Yates 

1631434 

19 550941 

229 329 728 

2 628 248 831 

Merges HUFFLE 

1636560 

19 686051 

231641075 

2 650387 993 

Rao-Sandelius 

1631519 

19 550449 

229 327 120 

2 628 251036 

BalancedShuffle 

1 889034 

22 046574 




Table 1. Average number of random bits used by our implementation of various random permutation al¬ 
gorithms over 100 trials. (The current implementation of BalancedShuffle were in Python rather than 
C, and are prohibitively slow on larger permutations, but preliminary results show that it converges to an 
improved number of random bits.) 
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A Code Listing for the MergeSort algorithm 


We reproduce here the most part of our algorithm, with some OpenMP QIH hints. The full code can be 
obtained at https : //github . com/axel-bacher/mergeshuf f le 

A.l The merge procedure 

// merge together two lists of size m and n-m 

void merge (unsigned int *t, unsigned int m, unsigned int n) { 
unsigned int *u = t; 
unsigned int *v = t + m; 
unsigned int *w = t + n; 

// randomly take elements of the first and second list according to flips 

while(1) { 

if( random_bit ()) { 

if (v == w) break; 

swap(u, V t+); 

} else 

if (u == v) break; 

u +t; 

} 

// now one list is exhausted, use Fisher-Yates to finish merging 
while (u < w) { 

unsigned int i = random_int(u - t + 1) ; 
swap(t + i, u t+); 

} 

} 

A.2 The MergeSort algorithm itself 

extern unsigned long cutoff; 

void shuffle (unsigned int *t, unsigned int n) { 

// select q = 2''c such that n/q <= cutoff 

unsigned int c = 0; 

while( (n >> c) > cutoff) c ++; 

unsigned int q = 1 << c; 

unsigned long nn = n; 

// divide the input in q chunks, use Fisher-Yates to shuffle them 
#pragma omp parallel for 

for(unsigned int 1=0; i < q; i ++) { 

unsigned long j = nn * i >> c; 


11 


} 


unsigned long k = nn * (i+1) >> c; 
fisher_yates(t + j, k - j); 


for(unsigned int p = 1; p < q; p += p) { 

// merge together the chunks in pairs 
#pragma omp parallel for 

for(unsigned int 1=0; i < q; i += 2*p) { 

unsigned long j = nn * i >> c; 

unsigned long k = nn * (i + p) >> c; 

unsigned long 1 = nn * (i + 2*p) >> c; 

merge(t+j, k-j, 1-j); 

} 

} 

} 
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